Synthesizing Union Tables from the Web

نویسندگان

  • Xiao Ling
  • Alon Y. Halevy
  • Fei Wu
  • Cong Yu
چکیده

Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al., 2008a; Elmeleegy et al., 2009; Limaye et al., 2010; Venetis et al., 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts only scratch the surface of the true value of structured data on the Web, and study the challenging problem of synthesizing tables from the Web, i.e., producing never-before-seen tables from raw tables on the Web. Table synthesis offers an important semantic advantage: when a set of related tables are combined into a single union table, powerful mechanisms, such as temporal or geographical comparison and visualization, can be employed to understand and mine the underlying data holistically. We focus on one fundamental task of table synthesis, namely, table stitching. Within a given site, many tables with identical schemas can be scattered across many pages. The task of table stitching involves combining such tables into a single meaningful union table and identifying extra attributes and values for its rows so that rows from different original tables can be distinguished. Specifically, we first define the notion of stitchable tables and identify collections of tables that can be stitched. Second, we design an effective algorithm for extracting hidden attributes that are essential for the stitching process and for aligning values of those attributes across tables to synthesize new columns. We also assign meaningful names to these synthesized columns. Experiments on real world tables demonstrate the effectiveness of our approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Markup-Agnostic Table Cell Extraction

Tables are very commonly used to present relational data. This report focuses on mining structured data from markup language specified tables. Table recogition, table interpretation and presentation of results are discussed. First, two categories of features are developed to recognize genuine tables. These recognized tables provide knowledge of table types we need in order to synthesize tables ...

متن کامل

How to read (and understand) Volume A of International Tables for Crystallography: an introduction for nonspecialists

Copyright c © International Union of Crystallography Author(s) of this paper may load this reprint on their own web site or institutional repository provided that this cover page is retained. Republication of this article or its storage in electronic databases other than as specified above is not permitted without prior permission in writing from the IUCr. For further information see http://jou...

متن کامل

Untangling the Web from DNS

The Web relies on the Domain Name System (DNS) to resolve the hostname portion of URLs into IP addresses. This marriage-of-convenience enabled the Web’s meteoric rise, but the resulting entanglement is now hindering both infrastructures—the Web is overly constrained by the limitations of DNS, and DNS is unduly burdened by the demands of the Web. There has been much commentary on this sad state-...

متن کامل

میزان همپوشانی مقالات سیستم تنفسی در دو پایگاه اطلاعاتی Scopus و Web of Science : گزارش کوتاه

Background: Due to the overlap between the databases of the subject and content, resulting in the purchase of duplication and waste of resources, in this study, the degree of overlap between respiratory system papers indexed in the database, Scopus and Web of Science during the years 2001 to 2010 were examined. Methods: In this survey study, researcher followed by obtaining percent overlap i...

متن کامل

Stitching Web Tables for Improving Matching Quality

HTML tables on web pages (“web tables”) cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the ta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013